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Abstract 

Bayesian methods are appealing in their flexibility in modeling complex data and ability in capturing 
uncertainty in parameters. However, when Bayes’ rule does not result in tractable closed-form, most 
approximate inference algorithms lack either scalability or rigorous guarantees. To tackle this challenge, 
we propose a simple yet provable algorithm, Particle Mirror Descent (PMD), to iteratively approximate 
the posterior density. PMD is inspired by stochastic functional mirror descent where one descends in the 
density space using a small batch of data points at each iteration, and by particle filtering where one 
uses samples to approximate a function. We prove result of the first kind that, with m particles, PMD 
provides a posterior density estimator that converges in terms of AL-divergence to the true posterior 
in rate We demonstrate competitive empirical performances of PMD compared to several 

approximate inference algorithms in mixture models, logistic regression, sparse Gaussian processes and 
latent Dirichlet allocation on large scale datasets. 
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1 Introduction 


Bayesian methods are attractive because of their ability in modeling complex data and capturing uncer¬ 
tainty in parameters. The crux of Bayesian inference is to compute the posterior distribution, p(9\X) oc 
p(9) Y\ n= xP{x n \9), of a parameter 9 G R d given a set of N data points X = {x n } n=1 from R D , with a prior 
distribution p{9) and a model of data likelihood p(x\9). For many non-trivial models from real-world applica¬ 
tions, the prior might not be conjugate to the likelihood or might contain hierarchical structure. Therefore, 
computing the posterior often results in intractable integration and poses computational challenges. Typi¬ 
cally, one resorts to approximate inference such as sampling, e.g ., MCMC [21 j and SMC m, or variational 
inference [531 HD] , 

Two longstanding challenges in approximate Bayesian inference are i) provable convergence and ii) data- 
intensive computation at each iteration. MCMC is a general algorithm known to generate samples from 
distribution that converges to the true posterior. However, in order to generate a single sample at every 
iteration, it requires a complete scan of the dataset and evaluation of the likelihood at each data point, which 
is computationally expensive. To address this issue, approximate sampling algorithms have been proposed 
which use only a small batch of data points at each iteration [ e.a. 151121 1351 126j . Chopin [5], Balakrishnan 
and Madigan [3j extend the sequential Monte Carlo (SMC) to Bayesian inference on static models. However, 
these algorithms rely on Gaussian distribution or kernel density estimator as transition kernel for efficiency, 
which breaks down the convergence guarantee of SMC. On the other hand, the stochastic Langevin dy¬ 
namics algorithm (SGLD) [35] and its derivatives [5J [7, T3] combine ideas from stochastic optimization and 
Hamiltonian Monte Carlo, and are proven to converge in terms of integral approximation, as recently shown 
in [55] 12(jj. Still, it is unclear whether the dependent samples generated reflects convergence to the true pos¬ 
terior. FireflyMC [26) . introduces auxiliary variables to switch on and off data points to save computation 
for likelihood evaluations, but this algorithm requires the knowledge of lower bounds of likelihood that is 
model-specific and may be hard to calculate. 

In another line of research, the variational inference algorithms [53] [35] i55] attempt to approximate 
the entire posterior density by optimizing information divergence [55]. The recent derivatives m avoid 
examination of all the data in each update. However, the major issue for these algorithms is the absence of 
theoretical guarantees. This is due largely to the fact that variational inference algorithms typically choose 
a parametric family to approximate the posterior density, which can be far from the true posterior, and 
require to solve a highly non-convex optimization problem. In most cases, these algorithms optimize over 
simple exponential family for tractability. More flexible variational families have been explored but largely 
restricted to mixture models [23l 115] . In these cases, it is often difficult to quantify the approximation and 
optimization error at each iteration, and analyze how the error accumulates across the iterations. Therefore, 
a provably convergent variational inference algorithm is still needed. 

In this paper, we present such a simple and provable nonparametric inference algorithm, Particle Mirror 
Descent (PMD), to iteratively approximate the posterior density. PMD relies on the connection that Bayes’ 
rule can be expressed as the solution to a convex optimization problem over the density space 03] (33] |35|. 
However, directly solving the optimization will lead to both computational and representational issues: one 
scan over the entire dataset at each iteration is needed, and the exact function update has no closed-form. 
To address these issues, we draw inspiration from two sources: (i) stochastic mirror descent, where one can 
instead descend in the density space using a small batch of data points at each iteration; and (ii) particle 
filtering and kernel density estimation, where one can maintain a tractable approximate representation of 
the density using samples. In summary, PMD possesses a number of desiderata: 

Simplicity. PMD applies to many probabilistic models, even with non-conjugate priors. The algorithm is 
summarized in just a few lines of codes, and only requires the value of likelihood and prior, unlike other 
approximate inference techniques @5] HH [33] [H e.g .], which typically require their first and/or second-order 
derivatives. 

Flexibility. Different from other variational inference algorithms, which sacrifice the model flexibility for 
tractability, our method approximates the posterior by particles or kernel density estimator. The flexibility 
of nonparametric model enables PMD to capture multi-modal in posterior. 
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Stochasticity. At iteration t, PMD only visits a mini-batch of data to compute the stochastic functional 
gradient, and samples 0(t) points from the solution. Hence, it avoids scanning over the whole dataset in 
each update. 

Theoretical guarantees. We show the density estimator provided by PMD converges in terms of both 
integral approximation and I\ L-divergence to the true posterior density in rate 0(l/y/m) with m particles. 
To our best knowledge, these results are the first of the kind in Bayesian inference for estimating posterior. 

In the remainder, we will introduce the optimization view of Bayes’ rule before presenting our algorithm, 
and then we provide both theoretical and empirical supports of PMD. 

Throughout this paper, we denote KL as the Kullback-Leibler divergence, function q(6) as q , a random 
sequence as 9y t 1 := [0 1; ...,0 t ], integral /(•) w.r.t. some measure p(9) over support D as f f(9)p(d9), or 
f f{9)dd without ambiguity, (■, -)l 2 as the L 2 inner product, and || • || p as the L p norm for 1 ^ p ^ 00 . 


2 Optimization View of Bayesian Inference 


Our algorithm stems from the connection between Bayes’ rule and optimization. 
[44], Zhu et al. [45] showed that Bayes’ rule 


p(9\X) 


P(°)Yln=lP( X n\0) 

P{X) 


Williams [|3J , Zellner 


where p(X) = Jp(9) Y\ n —\P{x n \9)dB, can be obtained by solving the optimization problem 


N r 

min L(q) := — > 
q{0)er ^ 

n =1 


q(6) logp(x n \6) dO 


KL(q(9) || p(9)), 


(1) 


where V is the valid density space. The objective, L(q), is continuously differentiable with respect to q £ V 
and one can further show that 

Lemma 1 Objective function L(q) defined on q[0) £ V is 1-strongly convex w.r.t. KL-divergence. 

Despite of the closed-form representation of the optimal solution, it can be challenging to compactly represent, 
tractably compute, or efficiently sample from the solution. The normalization, p(X) = f p(6) P( x n \9)d9, 
involves high dimensional integral and typically does not admit tractable closed-form computation. Mean¬ 
while, the product in the numerator could be arbitrarily complicated, making it difficult to represent and 
sample from. However, this optimization perspective provides us a way to tackle these challenges by lever¬ 
aging recent advances from optimization algorithms. 


2.1 Stochastic Mirror Descent in Density Space 

We will resort to stochastic optimization to avoid scanning the entire dataset for each gradient evaluation. 
The stochastic mirror descent |.‘12l expands the usual stochastic gradient descent scheme to problems with 
non-Euclidean geometries, by applying unbiased stochastic subgradients and Bregman distances as prox-map 
functions. We now explain in details, the stochastic mirror descent algorithm in the context of Bayesian 
inference. 

At t-th iteration, given a data point Xt drawn randomly from the dataset, the stochastic functional 
gradient of L(q) with respect to q{9) £ L 2 is gt(9) = log (q(6)) — log (p(9)) — N log p(xt\9). The stochastic 
mirror descent iterates over the prox-mapping step qt+ 1 = P 9t (7t<7t), where jt > 0 is the stepsize and 

p q(9) ■= argmin msV {(q,g) L2 + KL(q\\q)}. 

Since the domain is density space, A'L-divergence is a natural choice for the prox-function. The prox-mapping 
therefore admits the closed-form 

q t +i{9) = q t {9) exp(—7 t <7 t (0))/Z (2) 

= qtid^piey'pixtld^/Z, 
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where Z := f qt{0) exp(— lt9t{9)) dO is the normalization. This update is similar the Bayes’ rule. However, 
an important difference here is that the posterior is updated using the fractional power of the previous 
solution, the prior and the likelihood. Still computing q t +i(0) can be intractable due to the normalization 

2.2 Error Tolerant Stochastic Mirror Descent 

To handle the intractable integral normalization at each prox-mapping step, we will consider a modified 
version of the stochastic mirror descent algorithm which can tolerate additional error in the prox-mapping 
step. Given e ^ 0 and g £ L 2 , we define the e-prox-mapping of q as the set 

p ^(ff) : = {«£ 'P : KL(q\\q) +(g,q) Lz ^ mmq eV {KL(q\\q) + (g,q) L2 } + e}, (3) 

and consider the update q t +i(9) £ P|*(7 t9t)- When e t = 0,Vt, this reduces to the usual stochastic mirror 
descent algorithm. The classical results regarding the convergence rate can also be extended as below 
Theorem 2 Let q* = argmin ?g p L{q), stochastic mirror descent with inexact prox-mapping after T steps 
gives the recurrence: 

E[KL(q*\\q t+1 )\ < e t + (1 - lt )E[KL(q*\\q t )] + ^HWlo 
Remark 1: As shown in the classical analysis of stochastic mirror descent, we could also provide a non- 
asymptotic convergence results in terms of objective error at average solutions, e.g., simple average c[t = 
E7=i 7t«t/ ELi 7t in Appendix [b] 

Remark 2: For simplicity, we present the algorithm with stochastic gradient estimated by a single data 
point. The mini-batch trick is also applicable to reduce the variance of stochastic gradient, and convergence 
remains the same order but with an improved constant. 

Allowing error in each step gives us room to design more flexible algorithms. Essentially, this implies that 
we can approximate the intermediate density by some tractable representation. As long as the approximation 
error is not too large, the algorithm will still converge; and if the approximation does not involve costly 
computation, the overall algorithm will still be efficient. 


3 Particle Mirror Descent Algorithm 

We introduce two efficient strategies to approximate prox-mappings, one based on weighted particles and 
the other based on weighted kernel density estimator. The first strategy is designed for the situation when 
the prior is a “good” guess of the true posterior, while the second strategy works for general situations. 
Interestingly, these two methods resemble particle reweighting and rejuvenation respectively in sequential 
Monte Carlo yet with notable differences. 


3.1 Posterior Approximation Using Weighted Particle 

We first consider the situation when we are given a “good” prior, such that p(0) has the same support as 
the true posterior q*(9), i.e., 0 ^ q*(8)/p(9) < C. We will simply maintain a set of samples (or particles) 
from p(9), and utlize them to estimate the intermediate prox-mappings. Let {9i}™_ 1 ~ p(9) be a set of fixed 
Ltd. samples. We approximate qt+\(8) as a set of weighted particles 


9t+iW = E^i«i +1 W. 


t+ 1 ._ a* cxp(-7tgt(gj)) 

* ' Y.T=l a i ex P(-7tSt(6>i)) 


yt > 1 . 


(4) 


The update is derived from the closed-form solution to the exact prox-mapping step <§■ Since the normal¬ 
ization is a constant common to all components, one can simply update the set of working variable as 
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(5) 


oti-^a] 7t p(a; t |0 i ) JV7t ,Vi 

Oil 

^ \—\m • 

Z^i=l a i 

We show that the one step approximation 0 incurs a dimension -independent error when estimating the 
integration of a function. 

Theorem 3 For any bounded and integrable function f, E [|J q t (9)f(9)d9 — f q t (9)f(9)d9\] ^ 2C ^.°° . 

Remark. Please refer to the Appendix 0 for details. When the model has several latent variables 9 = 
(£,£) and some parts of the variables have closed-form update in ([2]). e.g., sparse GPs and LDA (refer to 
Appendix [Fj , we could incorporate such structure information into algorithm by decomposing the posterior 
q(9) = q(f)q(C |£). When p(£) satisfies the condition, we could sample {Ci}™ i ~ _p(£) and approximate the 
posterior with summation of several functions, i.e., in the form of q{9) ~ a i9(Cl£»)- 


3.2 Posterior Approximation Using Weighted Kernel Density Estimator 


In general, sampling from prior p(9) that are not so “good” will lead to particle depletion and inaccurate 
estimation of the posterior. To alleviate particle degeneracy, we propose to estimate the prox-mappings via 
weighted kernel density estimator (KDE). The weighted KDE prevents particles from dying out, in a similar 
fashion as kernel smoothing variant SMC M and one-pass SMC [3], but with guarantees. 

More specifically, we approximate qt+i(9) via a weighted kernel density estimator 


^ ■> TTL 

Qt+i{9) = ai K h (0 — 9i), 

z - J !,— l 


Oii 


exp(~7 t9t{9j)) 
££iexp(-m(0i))’ 


{Oi)Z 1 


i.i.d. 


qt(9 ), 


( 6 ) 


where h > 0 is the bandwidth parameter and Kh{9) := j^K(6/h) is a smoothing kernel. The update serves 
as an e-prox-mapping ([3]) based on the closed-form solution to the exact prox-mapping step ([2]). Unlike the 
first strategy, the particle location in this case is sampled from the previous solution qt(9). The idea here is 
that qf{9) = q t (6) exp(—'y t g t (9))/Z can be viewed as an importance weighted version of qt{9) with weights 
equal to exp (—7 t g t (9))/Z. If we want to approximate qt(9), we can sample m locations from qt{9) and 
associate each location the normalized weight ai. To obtain a density for re-sampling in the next iteration, 
we place a kernel function Kh{9) on each sampled location. Since ai is a ratio, we can avoid evaluating the 
normalization factor Z when computing ai. In summary, we can simply update the set of working variable 
ai as 

ai «- q t {6i)~ lt p{e i y it p{xt\9i) N " l \yi (7) 

Oii 

^ v-^m • 

Ui= 1 a i 


Intuitively, the sampling procedure gradually adjusts the support of the intermediate distribution towards 
that of the true posterior, which is similar to “rejuvenation” step. The reweighting procedure gradually 
adjusts the shape of the intermediate distribution on the support. Same as the mechanism in Doucet et al. 
nn. Balakrishnan and Madigan 0], the weighted KDE could avoid particle depletion. 

We demonstrate that the estimator in ([b]) in one step possesses similar estimation properties as standard 
KDE for densities (for details, refer to the Appendix [d|) . 

Theorem 4 Let qt be a (/3;£)- Holder density function, and K be a /3-valid density kernel, and the kernel 
bandwidth chosen as h = 0(m d + 2 P). Then, under some mild conditions, E j| qt(9) — qt(9)\\ l = 0(m d + 2 P). 

A kernel function AT(•) is called /3-valid, if / z s K(z)dz = 0 holds true for any s = (si, ..., So) € N d with 
|s| ^ \J3\. Notice that all spherically symmetric and product kernels satisfy the condition. For instance, the 
Gaussian kernel K(9 ) = (27r) _d / 2 exp(— ||0|| /2) satisfies the condition with /3 = 1, and it is used throughout 
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our experiments. Theorem [4] implies that the weighted KDE achieves the minmax rate for density estimation 
in (/3; £)-Holder function class [Tl], where /3 stands for the smoothness parameter and C is the corresponding 
Lipschitz constant. With further assumption on the smoothness of the density, the weighted KDE can 
achieve even better rate. For instance, if /? scales linearly with dimension, the error of weighted KDE can 
achieve a rate independent of the dimension. 

Essentially, the weighted KDE step provides an e-prox-mapping P~*( 7 t<?t) (|3j) in density space as we 
discussed in Section [2] The inexactness is therefore determined by the number of samples mt and kernel 
bandwidth ht used in the weighted KDE. 


3.3 Overall Algorithm 

We present the overall algorithm, Particle Mirror De¬ 
scent (PMD), in Algorithm [I] The algorithm is based 
on stochastic mirror descent incorporated with two 
strategies from section |3.1| and |3.2| to compute prox- 
mapping. PMD takes as input N samples A' = 
{ x ri }n=i , a P r i° r p(9) over the model parameter and 
the likelihood p(x\9), and outputs the posterior density 
estimator qr{9) after T iterations. At each iteration, 
PMD takes the stochastic functional gradient infor¬ 
mation and computes an inexact prox-mapping q t (9) 
through either weighted particles or weighted kernel 
density estimator. Note that as discussed in Section [2] 
we can also take a batch of points at each iteration 
to compute the stochastic gradient in order to reduce 
variance. 

In Section [4j we will show that, with proper setting 
of stepsize 7 , Algorithm [l] converges in rate 0(l/y/m) 
using m particles, in terms of either integral approxi¬ 
mation or ATA-divergence, to the true posterior. 


Algorithm 1 Particle Mirror Descent Algorithm 
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Input: Data set X = {x n }^ =1 , prior p(9) 
Output: posterior density estimator qr{9) 
Initialize q\{9) = p{9) 
for t = 1,2, ...,T — 1 do 

unif. ,, 

Xt ~ A 

if Good p{9) is provided then 
1 *~ d ' p{9) when t = 1 
on -<- a l 1_7t p(a; t |6»i) JV7t ,Vi 

ai E"1 m 
q t +i{9) = E 
else 

{9i)TA m 

oii <- q t (9i) " lt p{9 i y t p(x t \9 i ) N ^ t 


OLi 


:,Vi 


Qt+i(0) = E7Ji^K ht (9-9 t ) 

end if 
end for 


In practice, we could combine the proposed two algorithms to reduce the computation cost. In the 
beginning stage, we adopt the second strategy. The computation cost is affordable for small number of 
particles. After we achieve a reasonably good estimator of the posterior, we could switch to the first strategy 
using large size particles to get better rate. 

4 Theoretical Guarantees 

In this section, we show that PMD algorithm (i) given good prior p(9), achieves a dimension independent, 
sublinear rate of convergence in terms of integral approximation; and (ii) in general cases, achieves a dimen¬ 
sion dependent, sublinear rate of convergence in terms of AW-divergence with proper choices of stepsizes. 

4.1 Weak Convergence of PMD 

The weighted particles approximation, qt(9) = El=i a i $(9i)i returned by Algorithm [l] can be used directly 
for Bayesian inference. That is, given a function /, f q*{9)f{9)d9 can be approximated as El=i otif(9i). 
We will analyze its ability in approximating integral, which is commonly used in sequential Monte Carlo 
for dynamic models jjj and stochastic Langevin dynamics |36 1. For simplicity, we may write E;=i a if(9i) 
as f q t (9) f (9)d9 , despite of the fact that qt{9) is not exactly a density here. We show a sublinear rate of 
convergence independent of the dimension exists. 
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Theorem 5 (Integral approximation) Assume p(9) has the same support as the true posterior q*{9), 
i.e., 0 ^ q*{9)/p{9) ^ C. Assume further model ||p(a;|0) iv '|| oo ^ p,\/x. Then\/f(9) bounded and integrable, 
the T-step PMD algorithm with stepsize "ft — f returns m weighted particles such that 


E 


q T (9)f(9)d9 


q*(9)f(9)d9 


€ 


Vmax{C, pe M )}||/|| c 


max \ VKL(q*\\p), 


pM 





where M := max t= i ip .. :T ||st||oo- 


Remark. The condition for the models, ||p(a;| 6 , ) iv || CX) < p,V x, is mild, and there are plenty of models 
satisfying such requirement. For examples, in binary/multi-class logistic regression, probit regression, as 
well as latent Dirichlet analysis, p ^ 1. Please refer to details in Appendix[C] The proof combines the results 
of the weighted particles for integration, and convergence analysis of mirror descent. One can see that the 
error consists of two terms, one from integration approximation and the other from optimization error. To 
achieve the best rate of convergence, we need to balance the two terms. That is, when the number particles, 
m, scales linearly with the number of iterations, we obtain an overall convergence rate of 0(^)=). In other 

words, if the number of particles is fixed to m, we could achieve the convergence rate 0(-)=) with T = 0(m) 
iterations. 


4.2 Strong Convergence of PMD 

In general, when the weighted kernel density approximation scheme is used, we show that PMD enjoys a 
much stronger convergence, i.e., the KL -divergence between the generated density and the true posterior 
converges sublinearly. Throughout this section, we merely assume that 

• The prior and likelihood belong to (/3; £)-Holder class. 

• Kernel K(-) is a /3-valid density kernel with a compact support and there exists pt, v, 5 > 0 such that 
/ K(z) 2 dz < /i 2 , / \\z\\P\K(z)\dz < v. 

• There exists a bounded support D such that almost surely bounded awary from A -1 > 0. 

Note that the above assumptions are more of a brief characteristics of the commonly used kernels and 
inferences problems in practice rather than an exception. The second condition clearly holds true when the 
logarithmic of the prior and likelihood belongs to C oo with bounded derivatives of all orders, as assumed in 
several literature j35j I36| . The third condition is for characterizing the estimator over its support. These 
assumptions automatically validate all the c ond itions required to apply Theorem [4] and the corresponding 
high probability bounds (stated in Corollary |l7| in appendix). Let the kernel bandwidth h t = m t 1 ^ d+2 ^^ 
we immediately have that with high probability, 


Ut+i - Pg t (7t5t)lli < 0(m t /3/(d+2,3) ). 

Directly applying Theorem [2j and solving the recursion following [32], we establish the convergence results 
in terms of KL-divergence. 

Theorem 6 (KL-divergence) Based on the above assumptions, when setting 7 1 = min{ —— //\d+ 2 / 3 ) }, 


E[KL(q*\\q T )] 

where M := max t= i v .. : T ||gt|| c 
with 0 ( 1 ) being a constant. 




2 max {Di, M 2 } ELi ^ 
+ (- 1 - 


2/3 

d+2/3 


C2TTl T 


0 

d+2/3 


rji ‘ ~ -L rji 2 

D\ = KL(q*\\qi), C\ := 0(l)(/x + vC) 2 fi 2 A, and C 2 := 0(l)M(/i + uC) 















Table 1: Summary of the related inference methods 


Methods 

Provable 

Convergence 

Convergence 

Cost 


Black 



Criterion 

Rate 

Computation 
per Iteration 

Memory 

Box 

SVI 

No 

- 

- 

fi(d) 

o(d) 

No 

NPV 

No 

- 

- 

n(dm 1 2 N + d 2 N) 

O(dm) 

No 

Static SMC 

No 

- 

- 

fl(dm) 

0 (dm) 

Yes 

SGLD 

Yes 

l(9-9*,/)| 


n(d) 

O(dm) 

Yes 

PMD 

Yes 

1(9-9*,/)] 

*£(9*||9) 

0(m“5) 

0 {m~ s) 

£l(dm) 

Q(dm 2 ) 

0 (dm) 

0 (dm) 

Yes 


Remark. Unlike Theorem[5] the convergence results are established in terms of the AW-divergence, which 
is a stronger criterion and can be used to derive the convergence under other divergences |16| . To our best 
knowledge, these results are the first of its kind for estimating posterior densities in literature. One can 
immediately see that the final accuracy is essentially determined by two sources of errors, one from noise in 
applying stochastic gradient, the other from applying weighted kernel density estimator. For the last iterate, 
an overall O(^) convergence rate can be achieved when mt = 0(t 2+d ^). There is an explicit trade-off 
between the overall rate and the total number of particles: the more particles we use at each iteration, the 
faster algorithm converges. One should also note that in our analysis, we explicitly characterize the effect 
of the smoothness of model controlled by /3, which is assumed to be infinite in existing analysis of SGLD. 
When the smoothness parameter (3 » d, the number of particles is no longer depend on the dimension. 
That means, with memory budget O(dm), z.e., the number of particles is set to be O(m), we could achieve 
a 0 (1/y/m) rate. 

Open question. It is worth mentioning that in the above result, the 0(1/T) bound corresponding to the 

_ ,, 3 

stochasticity is tight (see Nemirovski et al. [32]), an d the 0(m d + 2 P ) bound for KDE estimation is also tight 
by itself (see [3]). An interesting question here is whether the overall complexity provided here is indeed 
optimal? This is out of the scope of this paper, and we will leave it as an open question. 


5 Related Work 

PMD connects stochastic optimization, Monte Carlo approximation and functional analysis to Bayesian 
inference. Therefore, it is closely related to two different paradigms of inference algorithms derived based on 
either optimization or Monte Carlo approximation. 

Relation to SVI. From the optimization point of view, the proposed algorithm shares some similarities to 
stochastic variational inference (SVI) |T9]-both algorithms utilize stochastic gradients to update the solution. 
However, SVI optimizes a surrogate of the objective , the evidence lower bound (ELBO), with respect to a 
restricted parametric distributiorQ while the PMD directly optimizes the objective over all valid densities in 
a nonparametric form. Our flexibility in density space eliminates the bias and leads to favorable convergence 
results. 

Relation to SMC. From the sampling point of view, PMD and the particle filtering/sequential Monte 
Carlo (SMC) [H] both rely on importance sampling. In the framework of SMC sampler [TO], the static SMC 
variants proposed in [H El bares some resemblances to the proposed PMD. However, their updates come 
from completely different origins: the static SMC update is based on Monte Carlo approximation of Bayes’ 
rule, while the PMD update based on inexact prox-mappings. On the algorithmic side, (i) the static SMC 
re-weights the particles with likelihood while the PMD re-weights based on functional gradient, which can 

1 Even in E], “nonparametric variational inference” (NPV) uses the mixture of Gaussians as variational family which is still 

parametric. 
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be fractional power of the likelihood; and (ii) the static SMC only utilizes each datum once while the PMD 
allows multiple pass of the datasets. Most importantly, on the theoretical side, PMD is guaranteed with 
convergence in terms of both /vL-divergence and integral approximation for static model , while SMC is only 
rigoriously justified for dynamic models. It is unclear whether the convergence still holds for these extensions 
in [SIE]. 

Summary of the comparison. We summarize the comparison between PMD and static SMC, SGLD, 
SVI and NPV in Table [l] For the connections to other inference algorithms, including Annealed IS [30l . 
general SMC sampler [lOj , stochastic gradient dynamics family mmmm, and nonparametric variational 
inference [551 I5T1 1571ITljl 155] , please refer to Appendix [ g] Given dataset {xi}fL 1 , the model p(x\0), 8 £ 
and prior p(9 ), whose value and gradient could be computed, we set PMD, static SMC, SGLD and NPV 
to keep m samples/components, so that they have the same memory cost and comparable convergence rate 
in terms of to. Therefore, SGLD runs 0(rn) iterations. Meanwhile, by balancing the optimization error 
and approximation in PMD, we have PMD running 0(m ) for integal approximation and for KL- 

divergence. For static SMC, the number of iteration is O(N). From Table [I] we can see that there exists 
a delicate trade-off between computation, memory cost and convergence rate for the approximate inference 
methods. 

1. The static SMC uses simple normal distribution [5] or kernel density estimation [5] for rejuvenation. 
However, such moving kernel is purely heuristic and it is unclear whether the convergence rate of SMC 
for dynamic system [9] [T7j still holds for static models. To ensure the convergence of static SMC, 
MCMC is needed in the rejuvenation step. The MCMC step requires to browse all the previously 
visited data, leading to extra computation cost fl(dmt) and memory cost O(dt), and hence violating 
the memory budget requirement. We emphasize that even using MCMC in static SMC for rejuvenation, 
the conditions required for static SMC is more restricted. We discuss the conditions for convergence 
of SMC and PMD using particles approximation in Appendix [C[ 

2. Comparing with SGLD, the cost of PMD at each iteration is higher. However, PMD converges in rate 
of 0(m~ 2 ), faster than SGLD, 0 (to _ 3), in terms of integral approximation and ILL-divergence which 
is more stringent if all the orders of derivatives of stochastic gradient is bounded. Moreover, even for 
the integral approximation, SGLD converges only when / having weak Taylor series expansion, while 
for PAID, / is only required to be bounded. The SGLD also requires the stochastic gradient satisfying 
several extra conditions to form a Lyapunov system, while such conditions are not needed in PAID. 

6 Experiments 

We conduct experiments on mixture models, logistic regression, sparse Gaussian processes and latent Dirich- 
let allocation to demonstrate the advantages of PMD in capturing multiple modes, dealing with non-conjugate 
models and incorporating special structures, respectively. 

Competing algorithms. For the mixture model and logistic regression, we compare our algorithm with 
five general approximate Bayesian inference methods, including three sampling algorithms, be., one-pass se¬ 
quential Monte Carlo (one-pass SMC) [5] which is an improved version of the SMC for Bayesian inference [5], 
stochastic gradient Langevin dynamics (SGD Langevin) [42j and Gibbs sampling, and two variational in¬ 
ference methods, be., stochastic variational inference (SVI) |19j and stochastic variant of nonparametric 
variational inference (SGD NPV) |T5]. For sparse GP and LDA, we compare with the existing large-scale 
inference algorithms designed specifically for the models. 

Evaluation criterion. For the synthetic data generated by mixture model, we could calculate the true 
posterior, Therefore, we evaluate the performance directly through total variation and I\ L-divergence (cross 
entropy). For the experiments on logistic regression, sparse GP and LDA on real-world datasets, we use 
indirect criteria which are widely used 0 m m m m because of the intractability of the posterior. We 
keep the same memory budget for Monte Carlo based algorithms if their computational cost is acceptable. To 
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(1) True Posterior (2) Estimated Posterior (3) Total Variation (4) Cross Entropy 

Figure 1: Experimental results for mixture model on synthetic dataset. 


demonstrate the efficiency of each algorithm in utilizing data, we use the number of data visited cumulatively 
as x-axis. 

For the details of the model specification, experimental setups, additional results and algorithm deriva¬ 
tions for sparse GP and LDA, please refer to the Appendix |H| 

Mixture Models. We conduct comparison on a simple yet interesting mixture model [32]: the observations 
Xi ~ pAf(6i,al) + (1 - p)Af(0 1 + 9 2l al) and 9 X ~ Af(0,af), 9 2 ~ A/”(0, erf), where {a 1: a 2 ) = (1,1), cr x = 2.5 
and p = 0.5. The means of two Gaussians are tied together making 9\ and 9 2 correlated in the posterior. 
We generate 1000 data from the model with (0i:$2) = (1,-2). This is one mode of the posterior, there is 
another equivalent mode at (9\,9 2 ) = (—1, 2). We initialize all algorithms with prior on (6*i, 9 2 ). We repeat 



number of visited samples number of visited samples number of visited samples < i o 5 

(1) Logistic regression on MNIST (2) Sparse GP on music data (3) LDA on wikipedia data 

Figure 2: Experimental results on several different models for real-world datasets. 


the experiments 10 times and report the average results. We keep the same memory for all except SVI. The 
true posterior and the one generated by our method is illustrated in Figure [l] (1) (2). PMD fits both modes 
well and recovers nicely the posterior while other algorithms either miss a mode or fail to fit the multimodal 
density. For the competitors’ results, please refer to Appendix [H] PMD achieves the best performance in 
terms of total variation and cross entropy as shown in Figure [I] (3) (4). This experiment clearly indicates our 
algorithm is able to take advantages of nonparametric model to capture multiple modes. 

Bayesian Logistic Regression. We test our algorithm on logistic regression with non-conjugate prior 
for handwritten digits classification on the MNIST8M 8 vs. 6 dataset. The dataset contains about 1.6M 
training samples and 1932 testing samples. We initialize all algorithms with same prior and terminate the 
stochastic algorithms after 5 passes through the dataset. We keep 1000 samples for Monte Carlo based 
algorithms, except Gibbs sampling whose computation cost is unaffordable. We repeat the experiments 10 
times and the results are reported in Figure [2^1) . Obviously, Gibbs sampling [20]: which needs to scan the 
whole dataset, is not suitable for large-scale problem. In this experiment, SVI performs best at first, which 
is expectable because learning in the Gaussian family is simpler comparing to nonparametric density family. 
Our algorithm achieves comparable performance in nonparametric form after fed with enough data, 98.8%, 
to SVI which relies on carefully designed lower bound of the log-likelihood |22j . SGD NPV is flexible with 


11 














































mixture models family, however, its speed becomes the bottleneck. For SGD NPV, the speed is dragged 
down for the use of L-BFGS to optimize the second-order approximation of ELBO. 

Sparse Gaussian Processes. We use sparse GPs models to predict the year of songs [5]. In this task, we 
compare to the SVI for sparse GPs Hnil32]and one-pass SMC. We also included subset of data approxima¬ 
tion (SoD) |25] as baseline. The data contains about 0.5M songs, each represented by 90-dimension features. 
We terminate the stochastic algorithms after 2 passes of dataset. We use 16 particles in both SMC and 
PMD. The number of inducing inputs in sparse GP is set to be 2 ln , and other hyperparameters of sparse 
GP are fixed for all methods. We run experiments 10 times and results are reported in Figure. [2^2) . Our 
algorithm achieves the best RMSE 0.027, significantly better than one-pass SMC and SVI. 

Latent Dirichlet Allocation. We compare to SVI HSI. stochastic gradient Riemannian Langevin dy¬ 
namic (SGRLD) [21], and SMC specially designed for LDA [5 on Wikipedia dataset fMj. The dataset 
contains 0.15M documents, about 2M words and 8000 vocabulary. Since we evaluate their performances in 
terms of perplexity, which is integral over posterior, we do not need to recover the posterior, and therefore, 
we follow the same setting in DUZ], where one particle is used in SMC and PMD to save the cost. We 
set topic number to 100 and fix other hyperparameters to be fair to all algorithms. We stop the stochastic 
algorithms after 5 passes of dataset. The results are reported in Figure [2^3). The top words from several 
topics found by our algorithm are illustrated in Appendix [H] Our algorithm achieves the best perplexity, 
significantly better than SGRLD and SVI. In this experiment, SMC performs well at the beginning since 
it treats each documents equally and updates with full likelihood. However, SMC only uses each datum 
once, while the stochastic algorithms, e.g., SGRLD, SVI and our PMD, could further refine the solution by 
running the dataset multiple times. 


7 Conclusion 

Our work contributes towards achieving better trade-off between efficiency, flexibility and provability in 
approximate Bayesian inference from optimization perspective. The proposed algorithm, Particle Mirror 
Descent, successfully combines stochastic mirror descent and nonparametric density approximation. Theo¬ 
retically, the algorithm enjoys a rate 0(l/y/rn) in terms of both integral approximation and A'L-divergence, 
with 0(m ) particles. Practically, the algorithm achieves competitive performance to existing state-of-the-art 
inference algorithms in mixture models, logistic regression, sparse Gaussian processes and latent Dirichlet 
analysis on several large-scale datasets. 
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Appendix 

A Strong convexity 

As we discussed, the posterior from Bayes’s rule could be viewed as the optimal of an optimization problem 
in Eq 0 We will show that the objective function is strongly convex w.r.t ATA-divergence. 

Proof for Lemma[ 7J The lemma directly results from the generalized Pythagaras theorem for Bregman 
divergence. Particularly, for A'A-divergence, we have 

KL(qi\\q) = KL{q x \\q 2 ) + KL(q 2 \\q) - (cq - q 2 , V</)(?) - V<j% 2 )) 2 

where f>(q) is the entropy of q. 

Notice that L(q) = KL(q\\q*) — logZ, where q* = ^ Z = Jp(9)H^p(xi\9), we have 

KL( qi \\q*) - KL(q 2 \\q*) - ( Ql - q 2l S7d?{q 2 ) - V0(</*)> 2 = KL{q x \\q 2 ) 

=> KL^Wq*) - KL(q 2 \\q*) - (q ± - q 2 ,\ogq 2 - log q*) 2 = AW(g 1 ||g 2 ) 

=> KL{ qi \\q*) - KL(q 2 \\q*) - ( qi - q 2 ,\ogq 2 - log (p(9)Ufp(xi\9))) 2 + (gi - q 2 \ogZ) 2 = KL( qi \\q 2 ) 

0 

=> L( qi ) - L(q 2 ) - ( qi - q 2 , VA(g 2 )) 2 = KL( qi \\q 2 ) 


B Finite Convergence of Stochastic Mirror Descent with Inexact 
Prox-Mapping in Density Space 

Since the prox-mapping of stochastic mirror descent is intractable when directly being applied to the opti¬ 
mization problem 0 , we propose the e-inexact prox-mapping within the stochastic mirror descent framework 
in Section [3] Instead of solving the prox-mapping exactly, we approximate the solution with e error. In this 
section, we will show as long as the approximation error is tolerate, the stochastic mirror descent algorithm 
still converges. 

Theorem[2] Denote q* = argmin qe -p L(q), the stochastic mirror descent with inexact prox-mapping after 
T steps gives 

(a) the recurrence: \/t < T, E[KL(q*\\q t+ i)} < e t + (1 - 'y t )^[KL(q*\\q t )] + 7 * E| ^“ 

(b) the sub-optimality: E[KL(q T \\q*)\ ^ E[L(q T )-L(q*)] < M ' g ^ f= ± t f + ^=^ et+Dl where q T = X)Li 7i9t/X)Li 7t 

Z^t=i It 

and Di = KL(q*\\qi) and M 2 := maxi^t^TlE||< 7 t||;C 

Remark. Based on [32j, one can immediately see that, to guarantee the usual rate of convergence, the error 
ti can be of order 0( 7 I). The first recurrence implies an overall 0(1/T) rate of convergence for the KL- 
divergence when the stepsize is as small as 0(1/7) and error e t is as small as 0(l/t 2 ). The second result 
implies an overall 0(l/y/T) rate of convergence for objective function when larger stepsize 7 1 = 0(l/y/T) 
and larger error et = 0 (l/t) are adopted. 

Proof for Theorem [1| (a) By first-order optimality condition, q t + 1 € P^iltgt) is equivalent as 

( 7 t9t + log(g t+ i) - log(q't), q t+ 1 - g)i 2 < e t ,\/q £ V, 


which implies that 

{ltgtAt +1 - q )2 < (log(gt) - \og(qt+i),q t +i - 9)2 + 64 = KL(q\\q t ) - KL(q\\q t+1 ) - KL(q t+1 \\q t ) + e t 
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Hence, 


(7 t9t,qt ~ g >2 < KL(q\\q t ) - KL(q\\q t+1 ) - KL(q t+1 \\q t ) + (7 t9t,qt - 9t+ 1)2 + 6 *. 
By Young’s inequality, we have 

1 7 2 

{7 t9t,9t - Qt+ 1)2 < - <?t+i||i + -y IlStll^o- 

Also, from Pinsker’s inequality, we have 


KL(q t+1 \\q t ) ^ -||g t — <7t+i||i- 


Therefore, combining ©, & and (flO]) , we have \/q £ V 


<7 t9tAt - 9)2 < e t + /\L((7||g t ) - A7L(g||g t+1 ) + y||g t || 


Plugging g* and taking expectation on both sides, the LHS becomes 


E, 


(qt - q*,itgt) 


= E, 


(qt - q*, 7t%«]) 




= E, 


Therefore, we have 


Ej. 


(qt - Q*, 7 tVL(g t )} 


(<7t - q*,ltVL(q t )) 


^e t +E x [KL(q*\\q t )]-E x [KL(q*\\q t+1 )]+^E x \\g t \ 


Because the objective function is 1-strongly convex w.r.t. ATA-divergence, 

(q' ~ q , VL(g') - VL(g)> = ^T(g'lk) + KL(q\\q'), 
and the optimality condition, we have 

(q t -q*,VL(q t )} > KL{q*\\q t ) 
we obtain the recursion with inexact prox-mapping, 

E x [KL(q*\\q t+1 )} < e* + (1 - 7t )E I [ifL( g *||g t )] + ^M 2 

(b) Summing over t = 1,..., T of equation we get 

T t t 2 

E E *[<9t - 9*,7tVL(g t )>] ^ + KL (?\\h) + E T M2 


t= 1 t= 1 

By convexity and optimality condition, this leads to 


£=1 


5^7* jE x [L(gr)-L(g*)] <E a 


s,£=l 


J2lt{L(q t ) - L{q*)) 


t—1 


^J2^t + KL( q *\\q 1 ) + Y / : ^M 2 


t—1 


t—1 


Furthermore, combined with the 1-strongly-convexity, it immediately follows that 
E x [KL(q T \\q*)\ < E x [L(g r )-L(g*)]^ 


^l^+El^t + D, 


Et =1 7t 


( 8 ) 


(9) 


( 10 ) 


( 11 ) 
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C Convergence Analysis for Integral Approximation 

In this section, we provide the details of the convergence analysis of the proposed algorithm in terms of 
integral approximation w.r.t. the true posterior using a good initialization. 

Assume that the prior p(9) has support Q cover true posterior distribution q* (9), then, we could represent 

q* (0)eJ= jg(0) = a(9)p(9), j a(9)p(9)d9 = 1,0 < a{9) < C 


Therefore, one can show 

Lemma 7 Vg £ T, let {9 i }^] =1 is i.i.d. sampled from p(9), we could construct q(9) = YYiLi yy™ a(e'j > su °h 
that V/(0) : —> R. bounded and integrable, 


E 


mmd9 


q(9)f(9)d9 


€ 


2 Vc\\f\u 


m 


Proof 

Given q(6), we sample i.i.d.{9i}^ 1 from p(9), and construct a function 

1 m 

m = -E“(wm- 

m z ' 

i= 1 


It is obviously that 


Eg[<?(0)] 


Eg 


m 

-y j a{9 i )5{9 i ,9) 


i =1 


1 

m 


E E * 

2=1 


a{9i)S(0i,0) 


q(0) 


and 


Eg 


q(9)f(0)d9 


Eg 


1 III 

-£a«w«y 


m 


E E « 


a(9i)f(0i) 


= J q(9)f(9)d9 


Then, 


Eg 


mmdO- / q(O)H9)d0 


= Eg 


m 


q(9)f(O)d0- Eg 


1 


mmd9 


= - E fl || a(9i)m)\\2 - l|E e[a(0i) m)]\\i < ~M a{9 i )f{9 i )\\i = ~ / <*Wf(9)M9)d9 


m 


TO 


1 

TO 


c 


c , 


a(9)f(9yq(0)d9< -\\mWio a(9)q(9)d9< -||/(0)||^||a 


By Jensen’s inequality, we have 


E e 


q(9)f(0)d0 - / q{9)f(9)d9 


< t/Eg 


q(0)md0- / <?(0)/(0)d0 


2l 




Vc\\f( 0 )\\ c 


Apply the above conclusion to f{9) = 1, we have 



TO 
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Let 9( 0 ) = EI Er^ ) (g ( .) , ’ ) ’ then Sr = !> and 


Eg 


= Efi 


= E fi 


mime - / mime 


= Eg 


m 1 m 

YPaiPi) ^ “VMM “WfW 


1 - 


1 - 


ET^i) 


m 

E?<*{Qi) 

m 


E?*(e t ) 


X>(fc)M) 


1 


ET^i) 

Then, we have achieve our conclusion that 


zxwwi 


^ E 


VT 


Hl/Wllc 




VC||/(6>)||c 


m 


Eg 


< Eg 


mmdB- / q(e)f(9)de 


mmdff- / 9 (0)/(0)^ 


Eg 


q(0)f(O)d0- / mmdS 


< 


zvcwfmu 


With the knowledge of p(8) and q(8), we set qt{0) = a t (9)p(9), the PMD algorithm will reduce to adjust 
a(6i) for samples {0j}£L 1 ~ n(9) according to the stochastic gradient. Plug the gradient formula into the 
exact update rule, we have 


gt(0)exp(-7 t 3 t (0)) 

qt+i(V) =-^- 


a t ( 0 )exp(- 7 tfft ( 0 ))p( 0 ) 

---= a t+ i(0)p(0) 


where a t +i{9) = a d e ) ox p O-itgt(S)) . gj nce % j g constant, ignoring it will not effect the multiplicative update. 

Given the fact that the objective function, L(q), is 1 -strongly convex w.r.t. the KL- divergence, we can 
immediately arrive at the following convergence results as appeared in Nemirovski et al. |32| . if we are able 
to compute the prox-mapping in Eq.Q exactly. 

Lemma 8 One prox-mapping step Eq. Q) reduces the error by 

E[KL(q*\\q t+1 )} < (1 - lt )E[KL(q*\\q t )} + 


With st.epsize = %> it implies 

E[KL(q*\\q T )] < max |jCL(g*|| gl ), 

Proof We could obtain the recursion directly from Theorem [2] by setting e = 0, which means solving the 
prox-mapping exactly, and the rate of convergence rate could be obtained by solving the recursion as stated 

in [32]. ■ 


Lemma 9 Let q t is the exact solution of the prox-mapping at t-step, then V f{9) : —> R, which is bounded 

and integrable, we have 


E 


q t {d)f{6)d6- / q(0)f(6)de 


^ max \ y/KL(q*\\ qi ), 


Vt 
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Proof 


E 


q t (9)f(8)d9- / q*(9)f(9)d9 


= E\\(q t (6)-q*(d)J(6)) L2 \\ 2 


<n\\qt(o)- q *mu 

^ max \y/KL(q*\\ qi ), 




3 E[||<? t (0)-g*(0)|| 1 ] < ll/HooE 


-KL{q*\\q t ) 


? 7 E ll3lloo 1 ll/ll 


V^F 7 ^ J Vi 

The second last inequality comes from Pinsker’s inequality. 


Theorem [5] A ssume the particle proposal prior p{9) has the same support as the true posterior q*{9), 
i.e., 0 ^ q*(9)/p(9) ^ C. With further condition about the model ||p(x|0) iv || oo ^ p, Vx, then V/(0) : —> K. 

bounded and integrable, with stepsize ^, the PMD algorithm return m weighted particles after T iteration 
such that 


E 




q t (9)f(9)d9- J q*(9)f(9)d9 

2y ma x{C. < ,expfl| 9 (D)|U)}||/|U +mm f V A . L(g , ||j) | 


y/2r] — 1 j VT ' 


Proof for Theorem [5| 

We first decompose the error into optimization error and finite approximation error. 


E 


< E 


q t (9)f(9)d9- / q*(9)f(9)d0 


q t (0)f(6)d0- / q t (9)f(9)d0 


-E 


q t (9)f(8)d9- / q*(O)f(O)d0 


finite approximation error e\ 

For the optimization error, by lemma [9j we have 

e -2 f max ^VKL(q*\\ qi ), 

Recall that 

m\ 9i-i(0)exp(-7t_i 5t _i(0)) 
Qt{ 0 ) = -v- 


optimization error €2 


Vt 


a t -i{e)TT(e){a t Jl 1 (9)p(x\9) N ' yt ~ 1 ) n x-n,- lta ^, a ^v{x\ e ) Nlt ~ 1 

— a t-1 \d) 7T \9) ry 


which results the update a t {9) = 


Z 

V t Zf i -- 1 {e) P (.x |0) N -n-i 


Notice Z = J q t (9) exp(— r ytg t (6))d9, we have 


exp(—7t||gt(0)|| oo ) < Z < exp(7 t ||g t (0)|| oo ). By induction, it can be show that ||a t ||oo < max{C',pexp(||5 t (0)|| oo )} < 
max{C,pexp(||g(0)|| oo )}. Therefore, by lemma[7| we have 


e l ^ 


2 v /max{C,pexp(|| ff (0)|| oo )}||/|| c 


Combine e\ and e 2 , we achieve the conclusion. ■ 

Remark: Simply induction without the assumption from the update of at. (9) will result the upper 
bound of sequence ||at||oo growing. The growth of sequence IjatHoo is also observed in the proof J] for 
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sequential Monte Carlo on dynamic models. To achieve the uniform convergence rate for SMC of inference 
on dynamic system, Crisan and Doucet Jjj, Gland and Oudjane m require the models should satisfy i), 
ev(9i) < p(xi\6i)p{6i\Qi-i) < e~ 1 iy(9 i ), Vx where v{9) is a positive measure, and ii), |gjp(a;|-)) ^ p ' 

Such rate is only for SMC on dynamic system. For static model, the trandistiion distribution is unknown, 
and therefore, no guarantee is provided yet. With much simpler and more generalized condition on the 
model, i.e., ||p(a:| 0 ) JV ’|| oo ^ p, we also achieve the uniform convergence rate for static model. There are 
plenties of models satisfying such condition. We list several such models below. 

1 . logistic regression, p(y\x,w) = 1 +exp( l yw T x) , and \\p{y\x, u>)||oo < 1 . 

2 . probit regression, p(y = l\x, w ) = <h(w; T :r) where $(•) is the cumulative distribution function of normal 
distribution. ||p(y|x,w )|| 00 ^ 1 . 

3. multi-category logistic regression, p(y = k\x,W) = , and \\p(y\x, W)^ ^ 1. 

4. latent Dirichlet allocation, 

p(x d \9 d ,<i>) = E Zd ^ p ( Zil g d) \p(x d \z d ,$)] 

N d W K 

P (x d \z dl <s>) = nnn $ “ dri ” 

n—lw=1k—1 
N d K 

P (z d \e d ) = nnr 

n—lk—1 

and ||p(x d | 0 d ,<f>)|| oo < max^ \\p(x d \z d , $)||oo < 1 - 

5. linear regression, p(y\w,x) = exp(—(y - w T x) 2 /2a 2 ), and \\p(y\w, a;)||oo < 77 ^. 

6. Gaussian model and PCA, p(x\p, E) = (27rdet(S)) _ 3 exp \{x — /r) T S(x — p) 

(27rdet(£))“3. 

D Error Bound of Weighted Kernel Density Estimator 

Before we start to prove the finite convergence in general case, we need to characterize the error induced by 
weighted kernel density estimator. In this section, we analyze the error in terms of both L\ and L 2 norm, 
which are used for convergence analysis measured by ATL-divergence in Appendix |E|. 

D.l Li-Error Bound of Weighted Kernel Density Estimator 

We approximate the density function q{9) using the weighted kernel density estimator q(0) and would like 
to bound the L\ error, i.e. ||Q'(^) — 9(0)||i both in expectation and with high probability. We consider an 
unnormalized kernel density estimator as the intermediate quantity 

.. m 

Qm{9) = — o(0i)K h (9,0i) 

z=i 

Note that E[£ m (0)] = E#. [cu(0i)Kh(0, 6i)\ = q* K Then the error can be decomposed into three terms as 
e := E ||9(0) - 9(0)11, < E ||g(0) - ^(0)11, +E \\ em {9) - E ^(0)11, + ||E g m (9) - 9(0)11, 

S -V-' "-V-•' s -V-' 

normalization error sampling error (variance) approximation error (bias) 


and \\p(x\p, 51)1100 < 
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We now present the proof for each of these error bounds. 

To formally show that, we begin by giving the definition of a special class of kernels and Holder classes 
of densities that we consider. 

Definition 10 {{ft; /x, v, <5)-valid density kernel) We say a kernel function K(-) is a (/3; /x, v)-valid density 
kernel, if I\ (6, 9) = K{6 — 9) is a bounded, compactly supported kernel such that 

(i) f K(z)dz = 1 

(ii) f \K{z)\ r dz ^ oo for any r ^ 1, particularly , f K{z) 2 dz ^ /x 2 for some /x > 0. 

(in) J z s K(z)dz = 0, for any s = (si,..., Sd) £ such that 1 ^ |s| ^ [_/3J. In addition, f ||;z||^|.K’ (z)\dz ^ 
v for some v > 0. 

For simplicity, we sometimes call K(-) as a /3-valid density kernel if the constants /x and v are not specifically 
given. Notice that all spherically symmetric compactly supported probability density and product kernels 
based on compactly supported symmetric univariate densities satisfy the conditions. For instance, the 
kernel K{9) = (27r) —d / 2 exp(— ||#|| 2 /2) satisfies the conditions with ft = oo, and it is used through out our 
experiments. Furthermore, we will focus on a class of smooth densities 

Definition 11 {{ft; £)-Holder density function) We say a density function q(-) is a {ft\ C)-Holder den¬ 
sity function if function q(-) is {ft\-times continuously differentiable on its support Q and satisfies 
(i) for any zq, there exists L(zq ) > 0 such that 

\q(z) - qio\z )| < L{z 0 )\\z - zqW 0 y z e n 

where qfj is the \_ft\-order Taylor approximation, i.e. 

<${*) ■■= E ^—^-D s q{z 0 ); 
s=(si,...,s c j):|s|<L/ 8 J 


(ii) in addition, the integral f L{z)dz ^ C. 
f £ means f is {ft; C)-Holder density function. 

Then given the above setting for the kernel function and the smooth densities, we can characterize the 
error of the weighted kernel density estimator as follows. 


D.1.1 KDE error due to bias 

Lemma 12 (Bias) If q() £ 6^(0) and K is a {ft; /x, v)-valid density kernel, then 

\\q{9)-nQm{9)}\\^vChP. 


Proof The proof of this lemma follows directly from Chapter 4.3 in m- 


|E[(?m(0)]- </(0)| = \q*K h {9)-q{0) | 


< 


/ b‘ K ^h)^ 9+z '^ ~ q ^ dz 

J K{z)[q{9 + hz) — q{8)]dz 
J K{z)\q{9 + hz) — q^ {9 + hz)\dz 


< L{9) / \K{z)\\hzfdz- 


K{z)\qf\8 + hz) — q{9)\dz 


K{z)[q^\9 + hz) — q{9)\dz 
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Note that q'J S \d + hz) — q(9) is a polynomial of degree at most [/3J with no constant, by the definition of 
(/3; n, y)-valid density kernel, the second term is zero. Hence, we have |E[g m (0)] — q(6 )| < uL(9)hP, and 
therefore 


11%, 


,(0)] - q(0 )||i < vh? J L(9)d9 sj vLh.P. 


D.1.2 KDE error due to variance 

The variance term can be bounded using similar techniques as in Pi- 
Lemma 13 (Variance) Assume co^/p € L\ with bounded support, then 

E||£>m(0) -E[e m (0)]|| 1 < d [LOy/pdO + o((mh d )~^). 

yjmhi J 

Proof For any 0, we have 

<^(0) :=E[(e m (0)-E[&„(0)]) 2 ] 

= -if]£[^(00*2(MO] - (q*K h ) 2 < 

m m 

2 = 1 

Denote p.{K) := J K{9) 2 d9 and kernel K + {6) = , then n(K) < p, J K + d9 = 1 and 


KW = Vd K + (0/d) = 


1 K{9/h)K(9/h) h d 


h d p 2 {I<) 


p?{K) 




Hence, 


p 2 (K){u] 2 p)*K+ u 2 [(uj 2 p)*K+-u; 2 p\ fr 2 (u 2 p) 


< t 2 ( 9 ) < ' v ^ // -^ sJ 


mh d mh d mh d 

Note that <r(0) = y/E [(g m (0) - E[g m (9)}) 2 } ^ E|g m (0) - E[g m (0)]|, hence 

E||e m (0)-E[ 0m (0)]|| 1 

= [ E| em (0)-E[p m (0)]|d0< I a(9)d9 


€ 


< 


< 


1 pi 2 (uj 2 p) * — u 2 p] / pi 2 (co 2 p) 


mh d 


mh d 


d9 


^[mh d ! 2 


J \Joj 2 p d9 + J \J (co 2 p) * — uj 2 p d9 


J uiytp d9 + yfllj ' J J | ( w2 p) * K~h ~ w 2 p\ d9 


y/mh d / 2 

From Theorem 2.1 in [12], we have J |(w 2 p) * — u> 2 p | d9 = o(l). Therefore, we conclude that 


E ||f?m(0) — E[g m (0)]|| 1 < 


^ [mh d ! 2 


\uy/p\\i + o{(mh d ) i). 
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D.1.3 KDE error due to normalization 


The normalization error term can be easily derived based on the variance. 

Lemma 14 (Normalization error) Assume to^/p £ L 2 

1 /2 

E ||g(0) - Q m {9)\\i < (/ u\0)p(d) dO^j . 

Proof Denote oJi := w(0j), then E[w*] = f oj(6)p(6) dO = 1 and E[wf] = / lo 2 (9)p( 9) d9, for any i = 
1 ,m. Hence, 

m 

e| —V on - 1| 2 = - u 2 {d)p{o)dd. 

m ' m J 

2=1 

Recall that q{9) = YJiU^i K h{9,9i) and 0m(9) = ^ EHi ^i K h(9,9i). 


E\\q(0) ~ 0 mm\i 
1 


< E 


< E 


< E 


E m 

i=i 


1 - 


y ^UjK h (6,6i) - —y^ y u>jK h (Q,0i) 
2=1 1 2=1 

E m -1 

i =1 1 


1 - 


m 

E m 
2=1 


^ ' ^id^h {9, 
Ei=i^ 

II^WIIi 


Since ||-K)t||i = / jflK{9/h) d9 = / K(9)d9 = 1, we have 


E||g(0) 


Qm (0) || : < 



tu 2 {9)p(0)d9 



\\ w Vph 


D.1.4 KDE error in expectation and with high probability 

Based on the above there lemmas, namely, Lemma |12|-111[ we can immediately arrive at the bound of the 
L\ error in expectation as stated in Theorem [4j We now provide the proof for the high probability bound 
as stated below. 


Corollary 15 (Overall error in high probability) Besides the above assumption, let us also assume 
that u>{9) is bounded, i.e. there exists 0 < Bi ^ B 2 < 00 such that Bi ^ w(0) ^ B 2 ,\/9. Then, with 
probability at least 1 — S, 

\\q{9) - q{9 )||i < vChP + ^ hd/2 \\uy/p\\ x + -j=\\oo^/p\\ 2 + -^= log(l /S) + o((mh d )~^). 

Proof We use McDiarmid’s inequality to show that the function /(©) = ||g(0) — g(0)||i, defined on the 
random data 0 = (6b,..., 6 m ), is concentrated on the mean. Let 0 = (0i,... , 9j, ... , 0 m ). We denote 
w = (w(0i),... ,w(0 m )) and Co = (w(0i),... ,uo{9j ),... ,w(0 m )). Denote k = (K h (9,9 1 ),.. .,K h (9,9 m )) and 
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k = ( Kh{6 , di),... , Kh{9,9j), ..., Kh(9,9 m )). We first show that |/(©) — /(©') | is bounded. 

|/( 0 )-/(©') 

< IK?e(0)-«e(0)Hi 

E m 7 v—r in ~ 7 

m sr^m ~ 

i =1 Ei=l x 

Z (% - w i) • (E”= 1 w i fc i) - (E™ 1 Ui)(u)iki - Wjkj) 
(Er=i^)-(E”i^) 


Co j - ujj 


ojf kf ujjkj 

(E™i Ci) 

00 

\—\m ~ 

Ej=l 


‘2B\B2 2 B 1 B 2 ^ 4:B\B2 

m m ^ m 


Invoking the McDiamid’s inequality, we have 


Pr(/(©) — Eq [/(©)] ^ e) < exp 


2 

me 

sb 2 b 2 


,Ve > 0 


which implies the corollary. 


D.2 L 2 -Error Bound of Weighted Kernel Density Estimator 

Following same argument yields also similar L 2 -error bound of the weighted kernel density estimator, i.e. 
||q(0) — q(9) H 2 - For completeness and also for future reference, we provide the exact statement of the bound 
below in line with Theorem [f] and Corollary [Ts] 

Theorem 16 (L 2 -error in expectation) Let q = cop £ and K be a (/?; p,v)-valid density kernel. 

Assume that to 2 p £ L 2 and has bounded support. Then 

E ||«(0) - q(0)\\l < 2 [vh^Cf + ^j\\u}^p\\l + o((mh d )~ 1 ). 

Proof for Theorem \ 1 6\ The square L 2 -error can also be decomposed into three terms. 

E ||g(0) - q(9)\\ 2 2 < 4E \\q(9) - g m (9)|| 2 +4E \\g m {9) - E g m (9)\\ 2 2 +2 ||E g m (9) - q(9)\\ 2 2 

V -V-' v -V-' v -V-' 

normalization error sampling error (variance) approximation error (bias) 


This uses the inequality (a + b + c) 2 < 2a 2 + 46 2 + 4c 2 for any a, b , c. 
|E [g m (9)} - q(9 )| < L(9) J \K(z)\\hzf dz9. Hence, 


From Lemma 


12 


we already have 


||![M0)] - q(6 )111 < v 2 h 20 / L 2 (9)d9 < [vh^Cf 


From proof for Lemma 13 we have 

E || g m (9) - E[g m (0)]||* = jE\g m (9) - E[g m {9)}\ 2 d9 ^ J a 2 {9) d9 

« / ‘■’""’tg - "’’’ 1 + AS* M IK* + o((mh*)-') 


( 12 ) 
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(13) 

(14) 






































In addition, we have for the normalization error term, 


E||9(0)-0m(0)ll2< E 


< E 


1 - 




Elii <•»< 




(15) 


Combining equation (12) , (13) and (15), it follows that 


E\\q(6)-q(e)\\ 2 2 ^2(vh 0 jr) 2 + 


8 pL 2 

mh d 


W^VpWI +o((mh d ) x ). 


Corollary 17 (^ 2 -error in high probability) Besides the above assumption, let us also assume thatoj(9) 
is bounded, i.e. there exists 0 < B\ ^ B 2 < 00 such that B\ ^ u>(0) ^ . Then, with probability at least 

1 - 8 , 

\\q(d) - q(0 )||| Sj 2 (vh^C) 2 + ^C|| Uy/p\\l + o{{mh d )~ l ) + WB ^^ 2/i y/log(l/5). 

Proof for Theorem \ 1 7| Use McDiarmid’s inequality similar as proof for Corollary [15] ■ 


E Convergence Analysis for Density Approximation 

In this section, we consider the rate of convergence for the entire density measured by A'L-divergence. We 
start with the following lemma that show the renormalization does not effect the optimization in the sense 
of optimal, and we show the importance weight w*( 0 ) = cx p(~ 7 tgt( e )) eac h step are bounded under proper 
assumptions. Moreover, the error of the prox-mapping at each step incurred by the weighted density kernel 
density estimation is bounded. 

Lemma 18 Let f qtdO, q t = is a valid density on LI, then, q+ = qf, where qf := argmin 9g7 ,(Q^ F t (q; qt), 
Qt := argmin ?e7 , (Q) F t {q\q t ), and F t {q;q') := (q, 7 * 5) i2 + I<L(q\\q'). 

Proof for Lemma [7ff| The minima of prox-mapping is not effected by the renormalization. Indeed, such 
fact can be verified by comparing to q+ = argmin F t (q; qt) and qf = argmin F t (q;qt), respectively. 

^ _ {jz^dt) 1 ~ lt p{9) 1 t p(x t \9) N 'i t _ g, 1 ~ 7 t p(^)tP(^t| 6 | ) jV7f _ -+ 

Qt f (jb^qt) 1 ~' yt p(0)tP{x t \9) N ' tt dd j q 1 -^ t p{e) 1 t p{xt\o) N ^de 9t 


Due to the fact, we use q’f following for consistency. Although the algorithm updates based on q t , it is 
implicitly doing renoramlization after each update. We will show that qt+\ is an e-inexact prox-mapping. 

Lemma 19 Assume for all mini-batch of examples ||<7t(0)||^ o ^ M 2 , then we have 

(a) exp(-2y t M) < w t (0) = < exp( 27 t M), 

(b) ||VJl(g+;g t )lloo<37 t M. 

Proof for Lemma [iff] Let Z := J q t {9) exp(— r y t g t (9))d9. We have exp(—q t M) < Z < exp(y t M). 

(a) Since ||s , t (0)||^ o ^ M 2 , we have 

exp(-2 7 ,M) < .,(«) = A® = ?5E<=|Ufi) s exp( 27t M). 

Qt(9) ^ 
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I ~~r 

(b) Also, because VF t (g* ) = 7*3* + log ^ = 7*3* + log(w*), it immediately follows 

l|VF t (3+;§ t )|| 00 = 117 ( 3 * +log(w*)|| 00 < 7 tll 3 t||oo + || log(w*)|| 00 < 7 t M + (27 t M) = 3 t t M. 


Lemma 20 Let e* := F t (qt+i',qt ) — Ft(9t'i9t)> which implies qt+i £ -P| t *( 7 t 3 t)- Let the bandwidth at step t 
satisfies 

h t = O{l)m~ 1/[d+20 \ 

one can guarantee that 

2/3 0 

!*[*_!], </[*_!]] ^ 0(l){p 2 + v 2 £ 2 )p 2 Am t d+2e + 0(l)M(p + vC) lt m t d+2fi 
In addition, with probability at least 1 — 25 in O t \x[ t _i], @[ t ~i], we have 

, _ 2/3 ,_ 0 

e t < 0(l)(p, 2 \/log(l/5) + iy 2 C 2 )p?Am t d+2li + 0(l)M(p. + v£+ i/log(l/ 5 )) 7 *TO t d+2pi 
where 0(1) is some constant. 

Proof for Lemma\2i 


Note that since qj(6) = q t (6) exp(- 7 * 3 *( 0 ))/Z, where q t (0) = Yttttti a i K h t (° - 0 *)> and 9t(0) = log(<?*) - 
log(p) — iVlog(p(a;t|0)). By our assumption, we have 3 * £ C^(fl) and exp(— 7 * 3 *) £ C^(fl); hence, qt £ 
C^(Q). Invoking the definition of function F*(-; q t ), we have 

Ft(qt+ 1 ; 9t) - ^(3V + ; 3t) = KL(q t+1 \\q+) + (VF*(g+; §*), 3 *+i - 3 * + )l 2 

< KL(q t+1 \\qf) + 37*M||3 f + — 3t+i||i 
r ( qt+i -qt) 2 


< 


dO + 3j t M\\qt — 3t+i||i 
< A|jg t+ i - qt\\l + 37*M||g*+ - gt+ilK 


qt 


Based on the definition of g*+i, we have 

1 


qt — 9t+i||i — 


i-C 


qt +1 - qt 


= ^Il9*+i - ^t + C<7t + I|i < - 9t~l|i 


— Il9t+i — 9t"l|i + C + °(C + Il'It+i _ 9t + i|i)- 


Similarly, 


1-C 


|9 t + -9t+i||2 = 


< 


j (qt +1 - 9* + ) + 1 5 


Hlft+i -3* H 2 


2 C 2 




(i-C) 2 "" + (i-C) 2 

< 2(1 + C) 2 ||9t+i - 9t + lll + 2C 2 ||^ + lll + o(C 2 ||3 t + ll2 + C 2 ||3t+1 - qt 111) 

Recall C = 1 — Jq 3t+i = ( 1 , 3* + — 9 t+i) ^ ||g*+i — 3+||i, we can simplify the L\ and L2 error as 


|3t+i-3 t + ||i = 2\\q t+1 -qt\\i+o(\\q t+1 -qt\\i), 

-3 t + ||2 + 2 ll3t + ll2ll3t+i -9, 

< (2 + 2||g+|| 2 )||3 t+1 - 3+|| 2 + o(\\q t+1 - 3+1| 2 ). 


t - 9t+i\\l ^ 2||3* + i - qt ||1 + 2||3 t + || 2 ||3 t+1 - 3+|| 2 + o(||g* + i - 3 +H 2 + ||3 < f || 2 |l9t+i ~ 9t + lli) 
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The last inequality for L 2 error comes from Jensen’s inequality. We argue that \\qt\\\ is finite. Indeed, 

9? exp(- 27 t5t ) exp(—2y t g t ) 


5+II 2 


= Uot YdO = 


Z 2 


-dd < 


Z 2 


qfdd 


< exp(47 t M) f Y] ala] J K h (9 - d z )K h (9 - 6 3 )dd 

< exp(47tM) ^^ a\a t j\\Kh{d — #j)|| 2 ||-K)i(0 — ^)|| 2 ) < exp(47 t M)^ 2 ||o; t ||i||Q! t || 00 < exp(47(M)lt 2 


hJ 


Therefore, we have 


e* < (2A + 2 A/r 2 exp( 47 t M))||? t+ i-g t + ||2 + 67 t M||g t+ i-g t + ||i 

+o(||g t+ i - Q t + || 2 +7t||9t+i - % + ||i) 


Applying the result of Theorem [ 4 ] and 16 for qt+i and qf we have 
E 0 [e t |o:[ t _i],0[ t _i]] < (2A + 2A^ 2 exp(47 t M)) 

T 


2 (^C) 2 + +o({m t hf) x ) 


+ 67 t M 


vCh 


* /— u d ! 2 

y/TTltht 


\Ut 


V&\\i 


y/rrh 


\ut\ftfth + o((m t hf) 2 ) 


+o^2(^/if £) 2 + ll^lli + 7 t[vCht + ^d /2 ll^iV^tlli + 


Under the Assumption C, we already proved that l^doo ^ exp(27 t M)i hence, \\u t \/qt || 2 ^ exp( 47 t M). 
Without loss of generality, we can assume f \Jq t {0)d0 ^ 0(1) and 7 t M < 0(1) for all f, then we can 

simply write ||wt\/5t||i ^ 0(1) and Ili ^ 0(1)- When h t = 0(l)m)T 1/ ^ d+2/3 \ the above result can be 

simplified as 


Eg[et\x[ t -i],d[ t - 


Similarly, combining the results of Corollary [15] and [lTJ we have with probability at least 1 — 25, 

2 {vhtCf + ^dW^tVftWl + o({m t h d )~ 1 ) + ^/\og(J/6) 


e t < 


(2A + 2Ap, 2 exp(4 7t M)) 

' Ch ‘ + 757 5 "" 1 ' 75 ' 11, + vb 11 "'^" 2 + cb 


+ 67 t M 

+0 


1 


2 (yh? t L) 2 + TO ^ci W^Wi +r Yt[v£hP + ^d /2 ll^t 111 + ~^f t 


\/8BiB 2 log(l/<5) + o({m t ht) *) 

Wtyfifth]) 


which leads to the lemma. 


Our main Theorem [6] follows immediately by applying the results in the above lemma to Theorem [2j 
Proof of Theorem We first notice that 


E[KL(q*\\q T )\ 


[ q* log -—dO 

= E 

f q* log J —+ 

f q* log ^dd 

J qr 


[J qr J 

qr 


E[A7L( 9 *]|g T )]+E 


q* log Z-dd 
qr 
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For the second term, 


E 

1 q* log ^-d6 

= E 

/ * 1 

(q ,iog : ) 


L J 9t J 


qr 


= E[<g*,-log(l-Cr)] 

= E[-log(l - Ct)] < Ct + o(Ct) ^E||g T -g+_ 1 || 1 +o(E||q : T -g+_ 1 || 1 ) 
By Theorem [ 4 ] and setting h t = 0(l)m t 1 ^ d+2/3 ^ j we achieve the error bound 


E 


q* log ^de 
qr 


^ C2m t 


0 

d+2/3 


where C 2 := + vC). 

When setting = minj^yj, ^ a/^d+ 2 ^) } invoking the above lemma, we have 

E e [ct|* [t _i],0[ t _i]] < C im ; 2f<nd+2p) , 

where C\ := 0(l)(/r + vC) 2 p 2 A. Expanding the result from Theorem [ 2 ] it follows that 

E x , e [KL(q*\\q t+1 )\ < (1 - lt )¥. x>e [KL(q*\\q t )] + Kd+20) + |w/ 2 

The above recursion leads to the convergence result for the second term, 


E[KL(q*\\q T )} < 


2max{Di,M 2 } E*Li t 2 " 


2/3 

d+2/3 


T 


+ Ci 


2 


Combine these two results, we achieve the desired result 


E[KL(q*\\q T )] < 2max{J>i,M 2 } +c EL 1 t 2 m t d+2P d+ E 


T 


y 2 


Remark. The convergence in terms of ATL-divergence is measuring the entire density and much more 
stringent compared to integral approximation. For the last iterate, an overall O(^) convergence rate can be 
achieved when rn t = 0(t 2+d /&). Similar to Lemma[9j with Pinsker’s inequality, we could easily obtain the the 
rate of convergence in terms of integral approximation from Theorem [6] After T steps, in general cases, the 
PMD algorithm converges in terms of integral approximation in rate 0(1/ y/T) by choosing 0(l/t)-decaying 
stepsizes and 0(t 2+5 ? )-growing samples. 


F Derivation Details for Sparse Gaussian Processes and Latent 
Dirichlet Allocation 

We apply the Particle Mirror Descent algorithm to sparse Gaussian processes and latent Dirichlet allocation. 
For these two models, we decompose the latent variables and incorporate the structure of posterior into the 
algorithm. The derivation details are presented below. 

F.l Sparse Gaussian Processes 

Given data X = {xj}” =1 , x.-, £ R dxl and y = {y,;}/ =1 . The sparse GP introduce a set of inducing variables, 

Z = l) £ R dxl and the model is specified as 

p(y n \u,Z) = AT (y„IA+ m Af“^u, K) 
p{u\Z) = A/"(u|0, K mm ). 
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where K mm = [k(zi, K nm = [k(x it ^•)]i=i,..., n; j=i,... )TO - For different if, there are different 

sparse approximations for GPs. Please refer jSF] for details. We test algorithms on the sparse GP model 
with K = /3 - 1 I. We modify the stochastic variational inference for Gaussian processes [TH1J for this model. 
We also apply our algorithm on the same model. However, it should be noticed that our algorithm could be 
easily extended to other sparse approximations [551 . 

We treat the inducing variables as the latent variables with uniform prior in sparse Gaussian processes. 
Then, the posterior of Z , u could be thought as the solution to the optimization problem 


mm 
9(2,u) , 


q(Z , u) log udz ~ / Q( z > u) \ogp(yi\xi, u, Z)dudZ 


The stochastic gradient of Eq.(16) w.r.t. q(Z, u) will be 


9 {q{Z , u)) = ^ log q(Z, u) - - \ogp{Z)p(u) - logp(yi\xi, u, Z) 


and therefore, the prox-mapping in t-step is 

q(z, u 


min [ q(Z, u)log— - u ) -— 7 —udZ — 7 * [ q(Z,u)logp(yAxi,u,Z)dudZ 


q(Z. 

which could be re-written as 


q(z) 


+ /,(«|Z)1- 9(u|z) 


g t (u|Z) 1 -Tt/«p(u|Z)T'‘A 


- It logp(yi\xi,u, Z) 


du >dZ 


L(q(u\Z)) 


(16) 


We update g t+ i(u|Z) to be the optimal of L(q(u\Z)) as 

q t+ i{u\Z) oc q t (u\Z) 1 ~' yt/n p(u\Zy t/n p(y i \x i ,u,Zy t 

= Af(u\m t , 5 t “ 1 ) 1_7t/n A/ r (u|0, K mm )^ n Af( yi \K im K-^u, r) 7t 
= Af (u|m t+ i, 

where T — diay^Ka Qa') -t- (d I , Qa — A^ m A mm AT m 2 , 

dt+i = (1 — 'Yt/ n )&t + r )t/ n K rnm + r )tKimK rnm T K mm K mi 
m t +1 = 5^ ( (1 - 7 tln)di X m t + 7 t/nK^mo + 7 t A'^A' rni T" 1 ?/) 


Plug this into the L(q(u\Z)), we have 

q(u\Z) 


L(q(u\Z)) = J q{u\Z) 


log 


q t {u\Z) l -^/ n p{vL\Z)^/ r 


- 7tlog p(yi\xi,u,Z) 


d u = - log p(yi\xi,Z) 


where 


p{Vi\xi,Z) = Jq t{ u\Z)' 7 t/n p(u\Z) r/t/n p(y i \x i ,u,Z)' ft du 

= J W(u|m t , <5 t _1 ) 1_7t/ "A/’(u|0, K mrn y t ^ n N{yi\K irn K^ m n, T) 7t du 
= AT(yi\K irn K mrn c, E) 
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where 


£t+i 

c 


r-l 

'-mm 


(1 -'y t /n)S t +'y t /nK, 

K+i (i l - 7 t/n)S t m t + jt/nK-^mo 


£ = 


K K~ l A -1 K~ x K -I_F 

7 1 


Solve 


min / log - 

q(z)J < 


?(^) 


-dZ - J q(Z) \ogp(yi\xi, Z)dZ 


q t {ZY-y^p{Z)^ 

will result the update rule for q(Z), 

qt+i(Z) cx q t {Z) 1 -*/ n p(Z)'K/ n p(y i \x i ,Z) 

We approximate the q(Z) with particles, i.e., q(Z) = Ej = 1 w : ’S(Z : >). The update rule for w : > is 

w t. ex p(— 7 t/nlog(tU() + 7 t/n logp(Z J ) +\ogp{yi\xi, Z^)) 


wi. i = 


Ej w t exp(- 7 t/nlog(u;^) + 'y t /n\ogp{Zi) + logp( 2 /i|£i, £•?)) 

F.2 Latent Dirichlet Allocations 

In LDA, the topics $ £ K a:xM ' are K distributions on the words W in the text corpora. The text corpora 
contains D documents, the length of the d-th document is N d - The document is modeled by a mixture 


of topics, with the mixing proportion 


pi xl< 


. The words generating process for X d is following: first 


drawing a topic assignment Zd n , which is 1-by-K indicator vector, ii.d.fronr 9d for word Xd n which is 1- 
by-W indicator vector, and then drawing the word Xdn from the corresponding topic $ Zdn - We denote 


Z d — { z dn\ n L i € 

probability is 


xK , X d = {x dn }n=i e K JV<iXM/ and X = {x d } d =i,Z = {Z d } d=1 . Specifically, the joint 

(17) 


p(x d ,z d ,9 d ,$) = p{xd\z d ,$)p(z d \9 d )p(9d)p($) 

N d W I< 

p(x d \z d ,$) = nnn*; 

n—lw—1k—1 
N d K 

p(z d \9d) = nn 


fyZdnkK dn 

~ kw 


QZdnk 

7 dk 


n—lk—1 

_ £ (Ka) 


The p($) and p(9) are the priors for parameters, p(9 d |a) = ^pr Ilf 9 d k 1 and P( $ IA)) = life r[p 0 )w TIL ~* w k > 
both are Dirichlet distributions. 

We incorporate the special structure into the proposed algorithm. Instead of modeling the p($) solely, 
we model the Z = {Z}^ =1 and $ together as q(Z, $). Based on the model, given Z , the q(<b\Z) will be 
Dirichlet distribution and could be obtained in closed-form. 

The posterior of Z , $ is the solution to 

7S.5 / q{z ’ log mzffm izdit ~ 5 £./ i(z.*)'o<sp(x^,*)dzd4> 

We approximate the hnite summation by expectation, then the objective function becomes 

q{Z,$) 


_ T(WPo) yjW 


min [ q(Z, <E>) log 
q(Z^) U J 


p(Z\a)p{$>\/3) 


d,Zd$> — lb 


q{Z,$) log p{x d \z d , $)dZd$ 


(18) 
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We approximate the q(Z) ~ YiL 1 by particles, and therefore, q(Z, $) ss YhL i tu l P($|Z’ 1 ) where 

P($|.Z*) is the Dirichlet distribution as we discussed. It should be noticed that from the objective function, 
we do not need to instantiate the z d until we visit the Xd■ By this property, we could first construct the 
particles {Z l }^ =1 ‘conceptually’ and assign the value to 1 when we need it. The gradient of Eq.(18) 

w.r.t. q(< E>, Z) is 

g(q(Z,$)) = -^log q(Z,$) - log p(®)p(Z) - E x [logp(x d \<S>, z d )] 

Then, the SGD prox-mapping is 

min / q(Z, <l>) log f q(Z,$) log q t (Z, $)/£> - \ogp($)p(Z)/D - logp(a; d |<f>, z d ) dZd<S> 

qt{z,9) J L 

We rearrange the prox-mapping, 

q(Z)q($\Z) 


mm 

q(Z)q(*\Z) 


q(Z)q($\Z) log 


7 t J q(Z)q(*\z) 


q t (Zy-^/ D q t ^\Zy-yt/D 

log p($)p(Z)/D + logp(x d \$, z d ) 


dZd<$> 


nun 

q(Z)q(*\Z) 


+ 


q(Z)\ log 


q{z) 


q t {Z) 1 -^/ D p (Z)^/ D 


q(*\Z) 


log gt ($| zy-*\ Z c P (^/ D ~ 7t logp(a;d|$ ’ 2d) ] M } dZ 


L(q(4>\Z)) 

The stochastic functional gradient update for q(Q\Z l ) is 

qt+iWZ*) oc qt^y-^/Hp^Y^pix^^dY 
Let q t ($>\Z l ) = Vir(f3l), then, the q t +i(^\Z l ) is also Dirichlet distribution 


qt+iWZ*) ocWr(^) l ^Kr(A) % IIII $ 


z > J2ri d $( z dnk = l,Xdnu> = l) 
' kw 


k w 


D 7t 


= Vir(f? t+ 1 ) 


where 7 1 = j t /D and 


N d 

[/3i-|-l]fctw = (1 'Yt)[{3t]kw T 'ytfdo T Djt ^ 8(z d nk = 1 ,%dnw — 1)- 

n 

In mini-batch setting, the updating will be 

j -j B N d 

[/bf+l]fei« = (1 — ')t)[Pt\kw + 'Ytfi 0 + EE d(z d nk — l,%dnw — 1)* 

d= 1 n 

Plug the g t+ i($|Z 8 ) into prox-mapping, we have 


L(q(*\Z)) = j q(<S>\Z) 

q(®\Z) 

= -log p(x d 

Zd, Z) 
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where p{x d \z d , Z l ) = q t (^\Z l ) x ^ ft p(^) :i/t p(x d \^ 1 z d ) D ^d^ which have closed-form 
p{x <i \z d ,Z i ) = [ q t {$\Z i ) 1 -^p{<S>)^p(x d \$,Zd) D ^d3> 

= f vir{pb x -^vir{piy'( nn$E" d5(zdnfc=i,a:<i ”” =i) ) 7t ^ 

TT / r (Er[^]fc^) V~^Y r ^^°) V t ^ >» ^( [ /? «+ 1 ] fc, ") 

L k L \n w rmk»)) \nPo) w J Wt + l\kw) 

and 

, w 

\ogp(x d \z d ,Z l ) oc Y f(i-7t)logr(^[^] fcu ,) + ^logr([/ 3 t I +1 ] feu ,) 

k ^ w w 

- logr^Pi+iU) - (1 - 7 t )E lo g r ([A 1 ]^)) 

w w ' 

Then, we could update qt{Z) = by 

q t +i{Z l ) oc q t {Z l )ex p ^ - ^log q t (Z l ) + ^\ogp(Z l \a) + \ogp(x d \z d , Z l )\ 

If we set a = 1, p(Z l ) will be uniformly distributed which has no effect to the update. For general setting, 
to compute logp(Z*|a), we need prefix all the {z l d }^ =1 . However, when D is huge, the second term will be 
small and we could ignore it approximately. 

Till now, we almost complete the algorithm except the how to assign z d when we visit x d . We could as¬ 
sign the z d randomly. However, considering the requirement for the z l d assignment that the q(z l d \Z^ d ) > 
0 , which means the assignment should be consistent, an better way is using the average or sampling 
proportional to / p(x d \$, z d )qt(®\Z t )p(z d \Z{_ id _ 1 )d® where p(z d \Z{ = f p{z d \a)p{a\Z{_ td _ x )da, or 

/ p(x d \$, z d )q t (®\Z l )p(z d \a)d$. 

G More Related Work 

Besides the most related two inference algorithms we discussed in Section ([5]), i.e., stochastic variational 
inference m and static sequential Monte Carlo 00, there are several other inference algorithms connect 
to the PMD from algorithm, stochastic approximation, or representation aspects, respectively. 

From algorithmic aspect, our algorithm scheme shares some similarities to annealed importance sam¬ 
pling (AIS) [ 3 D] in the sense that both algorithms are sampling from a series of densities and reweighting the 
samples to approximate the target distribution. The most important difference is the way to construct the 
intermediate densities. In AIS, the density at each iteration is a weighted product of the joint distribution of 
all the data and a fixed proposal distribution, while the densities in PMD are a weighted product of previous 
step solution and the stochastic functional gradient on partial data. Moreover, the choice of the temperature 
parameter (fractional power) in AIS is heuristic, while in our algorithm, we have a principle way to select 
the stepsize with quantitative analysis. The difference in intermediate densities results the sampling step in 
these two algorithms is also different: the AIS might need MCMC to generate samples from the intermediate 
densities, while we only samples from a KDE which is more efficient. These differences make our method 
could handle large-scale dataset while AIS cannot. 

Sequential Monte-Carlo sampler [IQ] provides a unified view of SMC in Bayesian inference by adopting 
different forward/backward kernels, including the variants proposed in |BJ[ 3 J as special cases. There are subtle 
and important differences between the PMD and the SMC samplers. In the SMC samplers, the introduced 
finite forward/backward Markov kernels are used to construct a distribution over the auxiliary variables. To 
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make the SMC samplers valid, it is required that the marginal distribution of the constructed density by 
integrating out the auxiliary variables must be the exact posterior. However, there is no such requirement 
in PMD. In fact, the PMD algorithm only approaches the posterior with controllable error by iterating the 
dataset many times. Therefore, although the proposed PMD and the SMC sampler bare some similarities 
operationally, they are essentially different algorithms. 

Stochastic approximation becomes a popular trick in extending the classic Bayesian inference methods to 
large-scale datasets recently. Besides stochastic variational inference, which incorporates stochastic gradient 
descent into variational inference, the stochastic gradient Langevin dynamics (SGLD) Welling and Teh j42l . 
and its derivatives [21 0 HI] combine ideas from stochastic optimization and Hamiltonian Monte Carlo 
sampling. Although both PMD and the SGLD use the stochastic gradient information to guide next step 
sampling, the optimization variable in these two algorithms are different which results the completely different 
updates and properties. In PMD, we directly update the density utilizing functional gradient in density 
space , while the SGLD perturbs the stochastic gradient in parameter space. Because of the difference in 
optimization variables, the mechanism of these algorithms are totally different. The SGLD generates a 
trajectory of dependent samples whose stationary distribution approximates the posterior, the PMD keeps 
an approximation of the posterior represented by independent particles or their weighted kernel density 
estimator. In fact, their different properties we discussed in Table [T] solely due to this essential difference. 

A number of generalized variational inference approaches are proposed trying to relax the constraints 
on the density space with flexible densities. Nonparametric density family is a natural choicc0 [37] and 
mm extend the belief propagation algorithm with nonparametric models by kernel embedding and particle 
approximation, respectively. The most important difference between these algorithms and PMD is that they 
originate from different sources and are designed for different settings. Both the kernel BP Song et al. m 
and particle BP Ihler and McAllester eu, Lienart et al. [25] are based on belief propagation optimizing local 
objective and designed for the problem with one sample X in which observations are highly dependent, while 
the PMD is optimizing the global objective, therefore, more similar to mean-field inference, for the inference 
problems with many i.i.d. samples. 

After the comprehensive review about the similarities and differences between PMD and the existing re¬ 
lated approximate Bayesian inference methods from algorithm, stochastic approximation and representation 
perspectives, we can see the position of the proposed PMD clearly. The PMD connects variation inference 
and Monte Carlo approximation, which seem two orthogonal paradigms in approximate Bayesian inference, 
and achieves a balance in trade-off between efficiency, flexibility and provability. 


H Experiments Details 

H.l Mixture Models 

We use the normalized Gaussian kernel in this experiment. For one-pass SMC, we use the suggested kernel 
bandwidth in [3]. For our method, since we increase the samples, the kernel bandwidth is shrunk in rate of 
0 (m _ 5) as the theorem suggested. The batch size for stochastic algorithms and one-pass SMC is set to be 
10. The total number of particles for the Monte Carlo based competitors, i.e., SMC, SGD Langevin, Gibbs 
sampling, and our method is 1500 in total. We also keep 1500 Gaussian components in SGD NPV. The 
burn-in period for Gibbs sampling and stochastic Langevin dynamics are 50 and 1000 respectively. 

The visualization of 10 runs average posteriors obtained by the alternative methods are plotted in Figure[3] 
From these figures, we could have a direct understand about the behaviors for each competitors. The Gibbs 
sampling and stochastic gradient Langevin dynamics sampling stuck in one local mode in each run. Gibbs 
sampler could fit one of the contour quite well, better than the stochastic Langevin dynamics. It should 
be noticed that this is the average solution, the two contours in the result of stochastic gradient Langevin 
dynamics did not mean it finds both modes simultaneously. The one-pass sequential Monte Carlo and 
stochastic nonparametric variational inference are able to location multiple modes. However, their shapes 

2 Although 1381 1151 named their methods as “nonparametric” belief propagation and “nonparametric” variational inference, 
they indeed use mixture of Gaussians, which is still a parametric model. 
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Figure 3: Visualization of posteriors of mixture model on synthetic dataset obtained by several inference 
methods. 


are not as good as ours. Because of the multiple modes and the highly dependent variables in posterior, the 
stochastic variational inference fails to converge to the correct modes. 

To compare these different kinds of algorithms in a fair way, we evaluate their performances using 
total variation and cross entropy of the solution against the true potential functions versus the number of 
observations visited. In order to evaluate the total variation and the cross entropy between the true posterior 
and the estimated one, we use both kernel density estimation and Gaussian estimation to approximate the 
posterior density and report the better one for Gibbs sampling and stochastic Langevin dynamics. The 
kernel bandwidth is set to be 0.1 times the median of pairwise distances between data points (median trick). 

In Figure |T])3) (4), the one-pass SMC performs similar to our algorithm at beginning. However, it cannot 
utilize the dataset effectively, therefore, it stopped with high error. It should be noticed that the one-pass 
SMC starts with more particles while our algorithm only requires the same number of particles at final stage. 
The reason that Gibbs sampling and the stochastic gradient Langevin dynamics perform worse is that they 
stuck in one mode. It is reasonable that Gibbs sampling fits the single mode better than stochastic gradient 
Langevin dynamics since it generates one new sample by scanning the whole dataset. For the stochastic 
nonparametric variational inference, it could locate both modes, however, it optimizes a non-convex objective 
which makes its variance much larger than our algorithm. The stochastic variational inference fails because 
of the highly dependent variables and multimodality in posterior. 


H.2 Bayesian Logistic Regression 


The likelihood function is 


p(y\x,w) 


1 

1 + exp(— yw T x) 


with w as the latent variables. We use Gaussian prior for w with identity covariance matrix. 
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We first reduce the dimension to 50 by PCA. The batch size is set to be 100 and the step size is set to be 
100+%/! ' We S * 0 P s t oc l ias ti c algorithms after they pass through the whole dataset 5 times. The burn-in 
period for stochastic Langevin dynamic is set to be 1000. We rerun the experiments 10 times. 

Although the stochastic variant of nonparametric variational inference performs comparable to our algo¬ 
rithm with fewer components, its speed is bottleneck when applied to large-scale problems. The gain from 
using stochastic gradient is dragged down by using L-BFGS to optimize the second-order approximation of 
the evidence lower bound. 

H.3 Sparse Gaussian Processes 

H.3.1 ID Synthetic Dataset 



(8)Posterior Convergence 


Figure 4: Visualization of posterior prediction distribution. The red curve is the mean function and the pale 
red region is the variance of the posterior. The cyan curve the ground truth. The last one shows convergence 
of the posterior mean to the ground truth. 

We test the proposed algorithm on ID synthetic data. The data are generated by 

y = 3x 2 + (sin(3.537rx) + cos(7.7nx)) exp(— 1.67t|x|) + O.le 

where x £ [—0.5, 0.5] and e ~ A/”(0,1). The dataset contains 2048 observations which is small enough to run 
the exact GP regression. We use Gaussian RBF kernel in Gaussian processes and sparse Gaussian processes. 
Since we are comparing different inference algorithms on the same model, we use the same hyperparameters 
for all the inference algorithms. We set the kernel bandwidth a to be 0.1 times the median of pairwise 
distances between data points (median trick), and /3 _1 = 0.001. We set the stepsize in the form of for 

both PMD and SVI and the batch size to be 128. Figure. [4] illustrates the evolving of the posterior provided 
by PMD with 16 particles and 128 inducing variables when the algorithms visit more and more data. To 
illustrate the convergence of the posterior provided by PMD, we initialize the u = 0 in PMD. Later, we will 
see we could make the samples in PMD more efficient. 

H.3.2 Music Year Prediction 

We randomly selected 463, 715 songs to train the model and test on 5,163 songs. As in [5j, the year values 
are linearly mapped into [0,1]. The data is standardized before regression. Gaussian RBF kernel is used in 
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the model. Since we are comparing the inference algorithms, for fairness, we fixed the model parameters for 
all the inference algorithms, ie., the kernel bandwidth is set to be the median of pairwise distances between 
data points and the observations precision /3 _1 = 0.01. We set the number of inducing inputs to be 2 10 and 
batch size to be 512. The stepsize for both PMD and SVI are in the form of —3-7=. To demonstrate the 

n 0 +Vt 

advantages of PMD comparing to SMC, we initialize PMD with prior while SMC with the SoD solution. 
We rerun the experiments 10 times. We use both 16 particles in SMC and PMD. We stop the stochastic 
algorithms after they pass through the whole dataset 2 times. 

H.4 Latent Dirichlet Allocation 



Figure 5: Several topics learnd by LDA with PMD 


We fix the hyper-parameter a = 0.1, /? = 0.01, and K = 100. The batchsize is set to be 100. We use 
stepsize for PMD, stochastic variational inference and stochastic Riemannian Langevin dynamic. For 

each algorithm a grid-search was run on step-size parameters and the best performance is reported. We stop 
the stochastic algorithms after they pass through the whole dataset 5 times. 

The log-perplexity was estimated using the methods discussed in 22 on a separate holdout set with 1000 
documents. For a document x d in holdout set, the perplexity is computed by 


where 


perp(a;d|A, a, /?) = exp 


En=i lo gP(srfn|A,a,/3) 

N d 


p(x d n\X,a,/3) = E @di$ 



(19) 


We separate the documents in testing set into two non-overlapped parts, ^estimation an d ^evaluation. yy e 
first evaluate the 9 d based on the ^estimation. p or diff eren t inference methods, we use the corresponding strate¬ 
gies in learning algorithm to obtain the distribution of 9 d based on ^estimation. yy e eva j ua t e p(x dn \X,a, (3) 
on ^evaluation the obtained distribution of 9 d . Specifically, 


P (xZ aluation \X,a,(3) = E w E eT ^ t , 


1 I <J>,C 
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For PMD, SMC and stochastic Langevin dynamics, 


/^evaluation 

Vdk 


/ jn—l 


S(K 


estimation 

dnk 


jy estimation 


Ka 


1 ) + a 


For stochastic variational inference, q(0d) is updated as in the learning procedure. 
We illustrate several topics learned by LDA with our algorithm in FigurejSj 
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